{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Audio Data Augmentation\n\n``torchaudio`` provides a variety of ways to augment audio data.\n\nIn this tutorial, we look into a way to apply effects, filters,\nRIR (room impulse response) and codecs.\n\nAt the end, we synthesize noisy speech over phone from clean speech.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torchaudio\nimport torchaudio.functional as F\n\nprint(torch.__version__)\nprint(torchaudio.__version__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Preparation\n\nFirst, we import the modules and download the audio assets we use in this tutorial.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import math\n\nfrom IPython.display import Audio\nimport matplotlib.pyplot as plt\n\nfrom torchaudio.utils import download_asset\n\nSAMPLE_WAV = download_asset(\"tutorial-assets/steam-train-whistle-daniel_simon.wav\")\nSAMPLE_RIR = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo-8000hz.wav\")\nSAMPLE_SPEECH = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042-8000hz.wav\")\nSAMPLE_NOISE = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-rm1-babb-mc01-stu-clo-8000hz.wav\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Applying effects and filtering\n\n:py:func:`torchaudio.sox_effects` allows for directly applying filters similar to\nthose available in ``sox`` to Tensor objects and file object audio sources.\n\nThere are two functions for this:\n\n-  :py:func:`torchaudio.sox_effects.apply_effects_tensor` for applying effects\n   to Tensor.\n-  :py:func:`torchaudio.sox_effects.apply_effects_file` for applying effects to\n   other audio sources.\n\nBoth functions accept effect definitions in the form\n``List[List[str]]``.\nThis is mostly consistent with how ``sox`` command works, but one caveat is\nthat ``sox`` adds some effects automatically, whereas ``torchaudio``\u2019s\nimplementation does not.\n\nFor the list of available effects, please refer to [the sox\ndocumentation](http://sox.sourceforge.net/sox.html)_.\n\n**Tip** If you need to load and resample your audio data on the fly,\nthen you can use :py:func:`torchaudio.sox_effects.apply_effects_file`\nwith effect ``\"rate\"``.\n\n**Note** :py:func:`torchaudio.sox_effects.apply_effects_file` accepts a\nfile-like object or path-like object.\nSimilar to :py:func:`torchaudio.load`, when the audio format cannot be\ninferred from either the file extension or header, you can provide\nargument ``format`` to specify the format of the audio source.\n\n**Note** This process is not differentiable.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Load the data\nwaveform1, sample_rate1 = torchaudio.load(SAMPLE_WAV)\n\n# Define effects\neffects = [\n    [\"lowpass\", \"-1\", \"300\"],  # apply single-pole lowpass filter\n    [\"speed\", \"0.8\"],  # reduce the speed\n    # This only changes sample rate, so it is necessary to\n    # add `rate` effect with original sample rate after this.\n    [\"rate\", f\"{sample_rate1}\"],\n    [\"reverb\", \"-w\"],  # Reverbration gives some dramatic feeling\n]\n\n# Apply effects\nwaveform2, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor(waveform1, sample_rate1, effects)\n\nprint(waveform1.shape, sample_rate1)\nprint(waveform2.shape, sample_rate2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that the number of frames and number of channels are different from\nthose of the original after the effects are applied. Let\u2019s listen to the\naudio.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_waveform(waveform, sample_rate, title=\"Waveform\", xlim=None):\n    waveform = waveform.numpy()\n\n    num_channels, num_frames = waveform.shape\n    time_axis = torch.arange(0, num_frames) / sample_rate\n\n    figure, axes = plt.subplots(num_channels, 1)\n    if num_channels == 1:\n        axes = [axes]\n    for c in range(num_channels):\n        axes[c].plot(time_axis, waveform[c], linewidth=1)\n        axes[c].grid(True)\n        if num_channels > 1:\n            axes[c].set_ylabel(f\"Channel {c+1}\")\n        if xlim:\n            axes[c].set_xlim(xlim)\n    figure.suptitle(title)\n    plt.show(block=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_specgram(waveform, sample_rate, title=\"Spectrogram\", xlim=None):\n    waveform = waveform.numpy()\n\n    num_channels, _ = waveform.shape\n\n    figure, axes = plt.subplots(num_channels, 1)\n    if num_channels == 1:\n        axes = [axes]\n    for c in range(num_channels):\n        axes[c].specgram(waveform[c], Fs=sample_rate)\n        if num_channels > 1:\n            axes[c].set_ylabel(f\"Channel {c+1}\")\n        if xlim:\n            axes[c].set_xlim(xlim)\n    figure.suptitle(title)\n    plt.show(block=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Original:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(waveform1, sample_rate1, title=\"Original\", xlim=(-0.1, 3.2))\nplot_specgram(waveform1, sample_rate1, title=\"Original\", xlim=(0, 3.04))\nAudio(waveform1, rate=sample_rate1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Effects applied:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(waveform2, sample_rate2, title=\"Effects Applied\", xlim=(-0.1, 3.2))\nplot_specgram(waveform2, sample_rate2, title=\"Effects Applied\", xlim=(0, 3.04))\nAudio(waveform2, rate=sample_rate2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Doesn\u2019t it sound more dramatic?\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Simulating room reverberation\n\n[Convolution\nreverb](https://en.wikipedia.org/wiki/Convolution_reverb)_ is a\ntechnique that's used to make clean audio sound as though it has been\nproduced in a different environment.\n\nUsing Room Impulse Response (RIR), for instance, we can make clean speech\nsound as though it has been uttered in a conference room.\n\nFor this process, we need RIR data. The following data are from the VOiCES\ndataset, but you can record your own \u2014 just turn on your microphone\nand clap your hands.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rir_raw, sample_rate = torchaudio.load(SAMPLE_RIR)\nplot_waveform(rir_raw, sample_rate, title=\"Room Impulse Response (raw)\")\nplot_specgram(rir_raw, sample_rate, title=\"Room Impulse Response (raw)\")\nAudio(rir_raw, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, we need to clean up the RIR. We extract the main impulse, normalize\nthe signal power, then flip along the time axis.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rir = rir_raw[:, int(sample_rate * 1.01) : int(sample_rate * 1.3)]\nrir = rir / torch.norm(rir, p=2)\nRIR = torch.flip(rir, [1])\n\nplot_waveform(rir, sample_rate, title=\"Room Impulse Response\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then, we convolve the speech signal with the RIR filter.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "speech, _ = torchaudio.load(SAMPLE_SPEECH)\n\nspeech_ = torch.nn.functional.pad(speech, (RIR.shape[1] - 1, 0))\naugmented = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Original:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(speech, sample_rate, title=\"Original\")\nplot_specgram(speech, sample_rate, title=\"Original\")\nAudio(speech, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### RIR applied:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(augmented, sample_rate, title=\"RIR Applied\")\nplot_specgram(augmented, sample_rate, title=\"RIR Applied\")\nAudio(augmented, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Adding background noise\n\nTo add background noise to audio data, you can simply add a noise Tensor to\nthe Tensor representing the audio data. A common method to adjust the\nintensity of noise is changing the Signal-to-Noise Ratio (SNR).\n[[wikipedia](https://en.wikipedia.org/wiki/Signal-to-noise_ratio)_]\n\n$$ \\\\mathrm{SNR} = \\\\frac{P_{signal}}{P_{noise}} $$\n\n$$ \\\\mathrm{SNR_{dB}} = 10 \\\\log _{{10}} \\\\mathrm {SNR} $$\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "speech, _ = torchaudio.load(SAMPLE_SPEECH)\nnoise, _ = torchaudio.load(SAMPLE_NOISE)\nnoise = noise[:, : speech.shape[1]]\n\nspeech_rms = speech.norm(p=2)\nnoise_rms = noise.norm(p=2)\n\nsnr_dbs = [20, 10, 3]\nnoisy_speeches = []\nfor snr_db in snr_dbs:\n    snr = 10 ** (snr_db / 20)\n    scale = snr * noise_rms / speech_rms\n    noisy_speeches.append((scale * speech + noise) / 2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Background noise:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(noise, sample_rate, title=\"Background noise\")\nplot_specgram(noise, sample_rate, title=\"Background noise\")\nAudio(noise, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### SNR 20 dB:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "snr_db, noisy_speech = snr_dbs[0], noisy_speeches[0]\nplot_waveform(noisy_speech, sample_rate, title=f\"SNR: {snr_db} [dB]\")\nplot_specgram(noisy_speech, sample_rate, title=f\"SNR: {snr_db} [dB]\")\nAudio(noisy_speech, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### SNR 10 dB:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "snr_db, noisy_speech = snr_dbs[1], noisy_speeches[1]\nplot_waveform(noisy_speech, sample_rate, title=f\"SNR: {snr_db} [dB]\")\nplot_specgram(noisy_speech, sample_rate, title=f\"SNR: {snr_db} [dB]\")\nAudio(noisy_speech, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### SNR 3 dB:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "snr_db, noisy_speech = snr_dbs[2], noisy_speeches[2]\nplot_waveform(noisy_speech, sample_rate, title=f\"SNR: {snr_db} [dB]\")\nplot_specgram(noisy_speech, sample_rate, title=f\"SNR: {snr_db} [dB]\")\nAudio(noisy_speech, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Applying codec to Tensor object\n\n:py:func:`torchaudio.functional.apply_codec` can apply codecs to\na Tensor object.\n\n**Note** This process is not differentiable.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "waveform, sample_rate = torchaudio.load(SAMPLE_SPEECH)\n\nconfigs = [\n    {\"format\": \"wav\", \"encoding\": \"ULAW\", \"bits_per_sample\": 8},\n    {\"format\": \"gsm\"},\n    {\"format\": \"vorbis\", \"compression\": -1},\n]\nwaveforms = []\nfor param in configs:\n    augmented = F.apply_codec(waveform, sample_rate, **param)\n    waveforms.append(augmented)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Original:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(waveform, sample_rate, title=\"Original\")\nplot_specgram(waveform, sample_rate, title=\"Original\")\nAudio(waveform, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 8 bit mu-law:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(waveforms[0], sample_rate, title=\"8 bit mu-law\")\nplot_specgram(waveforms[0], sample_rate, title=\"8 bit mu-law\")\nAudio(waveforms[0], rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### GSM-FR:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(waveforms[1], sample_rate, title=\"GSM-FR\")\nplot_specgram(waveforms[1], sample_rate, title=\"GSM-FR\")\nAudio(waveforms[1], rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Vorbis:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_waveform(waveforms[2], sample_rate, title=\"Vorbis\")\nplot_specgram(waveforms[2], sample_rate, title=\"Vorbis\")\nAudio(waveforms[2], rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Simulating a phone recoding\n\nCombining the previous techniques, we can simulate audio that sounds\nlike a person talking over a phone in a echoey room with people talking\nin the background.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sample_rate = 16000\noriginal_speech, sample_rate = torchaudio.load(SAMPLE_SPEECH)\n\nplot_specgram(original_speech, sample_rate, title=\"Original\")\n\n# Apply RIR\nspeech_ = torch.nn.functional.pad(original_speech, (RIR.shape[1] - 1, 0))\nrir_applied = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0]\n\nplot_specgram(rir_applied, sample_rate, title=\"RIR Applied\")\n\n# Add background noise\n# Because the noise is recorded in the actual environment, we consider that\n# the noise contains the acoustic feature of the environment. Therefore, we add\n# the noise after RIR application.\nnoise, _ = torchaudio.load(SAMPLE_NOISE)\nnoise = noise[:, : rir_applied.shape[1]]\n\nsnr_db = 8\nscale = (10 ** (snr_db / 20)) * noise.norm(p=2) / rir_applied.norm(p=2)\nbg_added = (scale * rir_applied + noise) / 2\n\nplot_specgram(bg_added, sample_rate, title=\"BG noise added\")\n\n# Apply filtering and change sample rate\nfiltered, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor(\n    bg_added,\n    sample_rate,\n    effects=[\n        [\"lowpass\", \"4000\"],\n        [\n            \"compand\",\n            \"0.02,0.05\",\n            \"-60,-60,-30,-10,-20,-8,-5,-8,-2,-8\",\n            \"-8\",\n            \"-7\",\n            \"0.05\",\n        ],\n        [\"rate\", \"8000\"],\n    ],\n)\n\nplot_specgram(filtered, sample_rate2, title=\"Filtered\")\n\n# Apply telephony codec\ncodec_applied = F.apply_codec(filtered, sample_rate2, format=\"gsm\")\n\nplot_specgram(codec_applied, sample_rate2, title=\"GSM Codec Applied\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Original speech:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "Audio(original_speech, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### RIR applied:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "Audio(rir_applied, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Background noise added:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "Audio(bg_added, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Filtered:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "Audio(filtered, rate=sample_rate2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Codec applied:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "Audio(codec_applied, rate=sample_rate2)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}